Chrysalis Semiconductor,a division of Chrysalis-ITS, is designing network security protocol processing ICs that, with associated software and printed-circuit boards, are enablers for e-commerce and for highly specialized transaction and communication-based markets. Chrysalis-ITS designs boththe hardware and software for these systems.
The first in a series of Chrysalis-ITS network security processors, the Luna 340 was fully emulated on the Axis Xtreme system to enable system verification and to accelerate software development and system tuning. Not only did we get the improved performance we sought, but we also reduced design time
for subsequent ICs. In fact, we were able to let our software designers start working on the emulated version of the design four months before we got silicon prototypes back.
The Luna 340 utilizes multiple embedded RISC cores to create a complete SoC providing both asymmetric and symmetric cryptographic processing as well as associated network security protocol processing. The Luna 341 is a
companion processor (Fig.1). Its high-speed modular exponentiation, encryption and hashing acceleration hardware is designed to work seamlessly with the Luna 340, providing acceleration of SSL-related cryptographic operations (RC4,MD5,SHA-1) via a 32-bit, 66-MHz PCI interface.
The Luna 340,a 5M-gate-equivalent ASIC design, was designed in partnership with Mosaid Technologies. It is more than 22 million transistors implemented in 0.25-micron technology. The Luna 340 contains five ARC RISC cores with associated memory, two PCI 66-MHz interfaces and a complex (256-bit)multiplier. The first spin of the device led to functional silicon, although the performance fell short of expectations. So two areas were identified for improvement.
First was the provision of a better hardware/software co-design environment; that's where Axis Systems' emulation capability came in. Second, as the security market becomes better defined, it is possible to implement dedicated hardware blocks instead of a microprocessor-based solution. The Luna 341, the first Chrysalis-ITS chip to utilize such blocks, uses a platform-based SoC design approach to produce a hierarchical, scalable,modular design that is a million-gate-equivalent ASIC. The encryption engines for modular exponentiation -- the RC4,MD5 and SHA-1 -- were all developed by
Chrysalis-ITS 'IC team. Use of multiple clock domains allowed the team to achieve maximum performance for each encryption engine. The Luna 341 was fully prototyped on multiple FPGAs and is currently being converted to an ASIC.
Because the Luna 340 is processor-based, there is a significant amount of software to design and verify, including drivers, application code and more than 128 kbytes of embedded firmware. The Luna 340 control processor receives commands from the host and dispatches the tasks to its own symmetric processors or
the Luna 341's crypto-accelerator. These commands control high-level cryptographic operations flexibly and
programmably.
Our verification tool set included VHDL and Verilog
simulators and software development tools from ARC,
Axis Systems, Cadence and Mentor Graphics. The logic
simulators lacked the performance we needed to do system-level verification, so we added Axis' hardware system, based upon reconfigurable computing (RCC) technology. With performance that is orders of magnitude faster than that of standard logic simulation, it is the cornerstone of our verification effort. The RCC system is used by the entire design team.
RCC technology uses programmable devices to enable simulation acceleration, system emulation and hardware/software co-verification on a single platform.
It requires only one design database. You can set up the
RCC system to accelerate logic simulation, or you can
use the emulation capability to make it look like the device and drive it from the real target system.
To our hardware engineers,the RCC system works like the logic simulator to which they are accustomed (Fig.2,page 42). Because it can instantly swap from software-based logic simulation to simulation acceleration and system emulation, our engineers were able to emulate at a very high speed and then
swap to logic simulation mode
for full debugging capabilities. To our software designers,it
looks like the actual hardware is
back from the fab. The system
allows them to work in their
preferred host environment. They can work with their standard development tools, includ-
ing instruction set simulators
and source-level debuggers, in-
stead of working with waveforms and large log files from
logic simulators. Software engineers can even single-step through their code, observing what happens to the
registers with each line of code.
The RCC technology accelerates logic simulation by
utilizing programmed coprocessors, where each processor is designed for a single task. The number and type
of processors selected during design compilation
depend on the design description. Then these hundreds of thousands of processors are mapped into
FPGAs and executed using an event-driven parallel-computing architecture with a highly efficient communication algorithm.
When we started development on the Luna 340 network security chip,we did not yet have RCC emulation. The Luna 340 was designed using a combination of Verilog and VHDL. When we brought the Axis RCC system in
at the end of the design process for the Luna 340B, we
had already synthesized the design to gates and were
easily able to run the synthesized gate-level representation of the design on the RCC system.Within one week of
taking delivery of the unit, we were modeling the Luna
340B on the box. We used the RCC system to verify the
re-spin and to provide a model to exercise the Luna 341.
Tape-out on the Luna 340B was a success.
For the Luna 341 design, we used Verilog. That allowed us to run and debug the design on the Axis box in
RTL code. The RCC system supports behavioral, RTL and
gate-level descriptions.
Verification challenges
We deliver a considerable amount of specialized software with our devices, and it is a major component in
the overall system. With such a large software component, it is critical to use a verification environment
that offers a cycle-accurate representation of the device and supports software bring-up. We chose to use
the ARC cores for this chip set because of the level of
customization they permitted. Since the cores are extensible and the customers usually make significant
modifications, the vendors could not easily provide cycle-accurate models. So without an easy path
to emulation, we were forced to make a trade-off between adding extra functionality to get increased performance and having a good hardware/software
co-design environment.
When we added the RCC system,we had a simple
means to use our gate-level design to provide a cycle-accurate
model for the software engineers. So in addition to being
able to verify functionality, we
were now able to identify prob-
lems such as contention and
wait states.
Software written to complete
a function in 5,000 clocks could
turn out to take 8,000 clock cycles because of contention for
resources. If the IC model is not
cycle-accurate, it 's very difficult
to identify the wait states and
hence predict what the performance will be. The RCC system is ideal for this situation.
Since our five ARC cores were synthesizable and we had
netlists, we simply ran the design on the RCC system;we
no longer needed vendor-supplied models. When we run
the system in emulation mode, it becomes the cycle-accurate model we are looking for, and we can run actual
software on it. Because the Axis system is both a simulation accelerator and an emulator, it really is much more
than just a cycle-accurate model. It provides a complete
simulation and debug environment that is useful for
both hardware and software engineers.
The other major challenge we faced was the lack of
raw simulation performance. With logic simulation, we
simply didn't have the throughput we needed to adequately test the system. But the RCC system offered significantly enhanced performance (see table).
For one particular test, the hardware engineers took a
logic simulation that required two days to run and
ported it to the RCC system. Running in simulation acceleration mode, the same test took only 40 seconds.
That raw performance allowed the team to add more
tests to verify much more
complex behaviors than
we originally could do. This
is important when dealing
with a multichip system.
Co-verification
Our design process has recently evolved to include
hardware/software co-verification.Because we were
able to run cycle-accurate
models of the hardware, we were able to dramati-
cally increase the confidence in the entire design,
both the hardware and
software components. On
the original design of the
Luna 340,hardware/software co-verification was
not used. The design team
was able to achieve three
levels of coverage on the
software code. The designers loaded the device drivers,such as the PCI driver
and the firmware, and
were then able to add the
application code on top. When we put the whole
system together it worked, but not as fast as we
thought it would, as was
discussed earlier.
When the design team
ran real customer software
through the RCC emulator,
we found that with software-implementation
changes, we could reduce
the number of cycles
needed for various tasks.
We were also able to identify the bottlenecks in the
hardware, which helped us
to focus our design effort.
That became evident in
the design of the companion chip, where complex instructions that had taken 350
cycles in the first revision of the design could now be
completed in 200 cycles.
With the cycle-accurate model of the ARC, our designers can get a feel for no-op states, where the core is trying to do two things at once. The core may need to
finish one activity before it gets to another because of
resource contention. When we determine why it is calling on one device too often, we can change the software
to avoid that. We run the software again and know for
sure that the contention was removed.
The RCC system becomes an absolutely accurate
model of our chip. The system is then connected to the
board with a cable, so that e have a model for the entire system. We simply run the software on that. Some of
the software may be running on the RCC system, some
on the microprocessor on the board, or maybe a combination of the two. Since the hardware is still an RTL description, it is straightforward to make hardware modifications based on the results. That permits true
trade-off analysis and lets us change the hardware or
software to maximize system performance.
The biggest gain we see in our approach to hardware/software co-verification is having the software developed using known-good hardware. When we started the Luna 341 design, the RCC system was used to emulate the Luna 340B and exercise the software while the
Luna 340B was being manufactured.That gave us a four-month head start on the software development cycle for
the Luna 341;e didn't have to wait four months for actual silicon.
Hardware/software co-verification increases our confidence that a re-spin won't be needed from 30 percent
to more than 90 percent, which potentially represents a
big savings. Saving a re-spin on 0.15-micron technology
is equivalent to saving close to $1 million, not to mention
the four months needed to fabricate a design.
An additional benefit e found in using the RCC system for hardware/software co-verification was the ability to force and release nodes as in a software
environment. We used this to simulate sequences such
as power-up. In security devices, power-up sequences
are very extensive, since it is important to ensure that
sensitive information is not inadvertently disclosed. We
also added behavioral code in the form of checkers and
watchers. That allowed us to display messages such as
“Chip has been cleared” to the Unix command line
when a specific sequence of events triggered the
checker. We achieved much improved visibility into the
operation of the system with these features.
RCC delivers the performance necessary for true
hardware/software co-verification, which enables early
testing of embedded software before prototypes are
available. It augments the verification done by hardware
engineers,leading to shorter project schedules and increased confidence in the completed design.
Debugging on an emulator
Most emulators lack advanced hardware debugging capabilities such as checkpointing. They cannot return to
any place in simulation time and continue interactive debugging. They also synthesize to gate-level representations, introducing ambiguity that is hard to overcome in
the debugging process.
RCC-based emulation does not have any of these limitations. It implements RTL directly on the programmed
coprocessors, so hardware verification engineers are debugging the code they wrote, not a synthesized representation of it. Because of the tight integration between
a software-based logic simulator and RCC, the engineers
can instantaneously swap between emulation and logic
simulation for unparalleled debug capabilities. Designers can return to any point in the simulation to view and
debug the design.
Our hardware verification engineers are finding that
the approach has huge advantages over traditional emulation solutions. They run regressions nightly and examine the results every morning. Since the logic
simulation won't tie up the emulator, we continue to
run real test cases while designers debug the previous
night 's results.
The setup was simple: It took
less than one day to get the
Luna 341 design up and running
on the RCC system. And the
compile time was relatively
short: It took one hour to compile the entire system description into the box. Unlike most
emulators, the RCC system does
not require the use of a logic analyzer, further simplifying the
setup requirements. Instead of
setting probes to trigger at certain events, engineers can specify the time period and
depth of what they would like to see in the VCD results.
Increases in run-time performance enabled our designers to complete additional testing. The design team
was thrilled that these great gains in performance did
not come at the expense of debug capabilities.
The results
The Luna 340,a 5M-gate-equivalent ASIC design, was
fully emulated on the Axis Xtreme system to enable system verification and to accelerate software development
and system tuning.
Not only were we able to improve system perfor-
mance through emulation-based HW/SW co-verification,
e also ere able to shorten our design cycle. We ere
able to let our software designers start working on the
emulated version of the design four months before e
got silicon prototypes back.
Now that we're at 0.15 micron,one iteration of silicon
would cost us four months and $1 million. Emulation
isn't optional anymore.The emulator is running all the
time, and there's a waiting list to use it.